Regression?
Regression!

PSCI 8357 - STAT II

Georgiy Syunyaev

Department of Political Science, Vanderbilt University

February 16, 2026

DiM vs. Regression


  • So far we considered the difference in means as our naive estimator of causal quantities.
  • This week we will see that we might use regression agnostically to estimate causal estimands as well.

    • this makes our life easier, especially if we would like to rely on the conditional ignorability assumption. (Why?)
  • BUT this only solves the estimation problem.

    • We still have to make assumptions to achieve causal identification!
  • Problem: If we want to learn about the relationship between \(X\) and \(Y\)

    • The ideal is to learn about \(f_{YX}(\cdot)\),
    • In practice we learn about \({\mathbb{E}}[Y {\:\vert\:}X]\).

CEF

Conditional Expectation Function (CEF)

CEF

The CEF, \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\), is the expected value of \(Y_i\) given (conditional on) \(X_i\):

  • For continuous \(Y_i\) \[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = \int_{\mathcal{Y}} y f(y {\:\vert\:}X_i) \, dy \]

  • For discrete \(Y_i\): \[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = \sum_{\mathcal{Y}} y p(y {\:\vert\:}X_i) \]

  • Population-Level Function: Describes the relationship between \(Y_i\) and \(X_i\) in the population (finite or super).
  • Functional Flexibility: Can be non-linear (!).

Decomposition of Observed Outcomes

CEF Decomposition Property

\[ Y_i = \underbrace{{\mathbb{E}}[Y_i {\:\vert\:}X_i]}_{\text{explained by $X_i$}} + \underbrace{\varepsilon_i}_{\text{unexplained}}, \]

where \({\mathbb{E}}[\varepsilon_i {\:\vert\:}X_i] = 0\) and \(\varepsilon_i\) is uncorrelated with any function of \(X_i\)

  • Intuition: The CEF isolates the systematic component of \(Y_i\) explained by \(X_i\), while \(\varepsilon_i\) captures noise.
  • To see this property recall

    \[ \begin{align*} \varepsilon_i &= Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] \quad \implies\\ {\mathbb{E}}[\varepsilon_i {\:\vert\:}X_i] &= {\mathbb{E}}[Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] {\:\vert\:}X_i] = 0 \end{align*} \]

  • also \({\mathbb{E}}[h(X_i) \varepsilon_i] = 0\). (How can we use Law of Iterated Expectations to prove this?)

Best Minimal MSE Predictor

CEF Prediction Property

\[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = {\arg\!\min}_{g(X_i)} {\mathbb{E}}\left[ (Y_i - g(X_i))^2 \right], \] where \(g(X_i)\) is any function of \(X_i\).

  • Intuition: CEF is the best method for predicting \(Y_i\) in the least squares sense.
  • To see this property decompose the squared expression:

\[ \begin{align*} (Y_i - g(X_i))^2 &= \left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] + {\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right)^2 \\ &= \left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]\right)^2 + 2\left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]\right)\left({\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right) \\ &\quad + \left({\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right)^2. \end{align*} \]

Discrete Case CEF


Density distributions show the spread of \(Y\) values at each discrete \(X\); black line connects the conditional means.

  • The CEF is the average line through the scatter of data points for each discrete \(X\).

Why Does CEF Matter?



  • The CEF properties we just established are important because:

    1. Decomposition: Any outcome can be split into a systematic part (explained by covariates) and noise.

    2. Optimality: The CEF is the best predictor of \(Y_i\) given \(X_i\) in the MSE sense.

  • Key insight: If we can estimate \({\mathbb{E}}[Y_i {\:\vert\:}T_i]\) or \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\) well, we can estimate differences in conditional means—which under the right assumptions are causal effects.
  • The question becomes: Can regression help us estimate the CEF?

Regression Justification

Regression?



  • The \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) quantity looks very familiar, we already used it in \({\mathbb{E}}[Y_i {\:\vert\:}T_i]\) or \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\).

  • We want to see if regression helps us with estimating these quantities. Especially when we want to estimate differences in means.

  • Note: There is nothing causal in \({\mathbb{E}}[Y_i {\:\vert\:}T_i]\) or \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\), so we still need identification

    • We can for example rely on strong or conditional ignorability.

Some Regression Coefficient Properties

  • Before we move on to this, we need to recall important facts about regression coefficients

    1. Population regression coefficients vector is given by (directly follows from \({\mathbb{E}}[X_i \varepsilon_i] = 0\)) \[ \beta = {\mathbb{E}}[X_i X_i^{\prime}]^{-1} {\mathbb{E}}[X_i Y_i] \]

    2. Regression coefficient in single covariate case is given by (population and sample analog) \[ \beta = \frac{{\mathrm{cov}}(Y_i,X_i)}{{\mathbb{V}}(X_i)}, \quad \widehat{\beta} = \frac{\sum_{i = 1}^{n} (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i = 1}^{n} (X_i - \bar{X})^2} \]

    3. Regression coefficient in multiple covariate case is given by \[ \beta_{k} = \frac{{\mathrm{cov}}(\tilde{Y}_i,\tilde{X}_{ki})}{{\mathbb{V}}(\tilde{X}_{ki})}, \] where \(\tilde{X}_{ki}\) is the residual from regressing \(X_k\) on \(X_{-k}\)

Justification 1: Linearity


Theorem: Linear CEF

If CEF \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) is linear in \(X_i\), then the population regression function \(X_i^{\prime} \beta\) returns exactly \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\).

  • To see this property we can

    • Use decomposition property of CEF to see \({\mathbb{E}}[ X_i (Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]) ] = 0\)

    • Substitute for \({\mathbb{E}}[Y_i {\:\vert\:}X_i] = X_i^{\prime} b\) and solve

  • How plausible is this linearity assumption in practice?
  • Always true in the simple case where \(T_i\) is a binary treatment indicator: \({\mathbb{E}}[Y_i {\:\vert\:}T_i] = \beta_0 + \beta_1 T_i\).

Binary Case CEF


Justification 2: Linear Approximation

  • What if the CEF is not linear?
  • Regression can still be used to approximate the CEF:

Regression Prediction Property

The function \(X_i' \beta\) provides the Minimal MSE linear approximation to \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\), that is:

\[ \beta = {\arg\!\min}_b {\mathbb{E}}\left[ ({\mathbb{E}}[Y_i {\:\vert\:}X_i] - X_i' b)^2 \right]. \]

  • Intuition: Even if CEF is not linear we can use regression to approximate it and make substantive conclusions
  • To see this we can decompose the squared error function minimized by OLS

\[ \begin{align*} (Y_i - X_i' b)^2 &= \left( (Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]) + ({\mathbb{E}}[Y_i {\:\vert\:}X_i] - X_i' b) \right)^2 \\ &= (Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i])^2 + ({\mathbb{E}}[Y_i {\:\vert\:}X_i] - X_i' b)^2 \\ &\quad + 2 (Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]) ({\mathbb{E}}[Y_i {\:\vert\:}X_i] - X_i' b). \end{align*} \]

  • The first term doesn’t involve \(b\).
  • The last term has an expectation of zero due to the CEF-decomposition property.

Approximation of Discrete Case CEF


What Does This All Mean?




  • In the case of CEF with respect to binary \(X_i\) (think \(T_i\)), OLS provides an estimate of \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) which is the same as difference in means.
  • In the case of CEF linear in \(X_i\), OLS provides estimate of \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) which is (constant) increase in means of \(Y_i\).
  • In the case of CEF non-linear in \(X_i\), OLS provides the best linear approximation of \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\)

Regression and Causality

Back to Simple Binary Setup


  • Suppose \(\mathcal{T} = \{0, 1\}\)

  • Under SUTVA (no interference and consistency) POs are \(Y_{i} (1)\) and \(Y_{i} (0)\).

  • A unit-level treatment effect is, \(\tau_i = Y_{i} (1) - Y_{i} (0)\)

    • \({\mathbb{E}}[\tau_i] = {\mathbb{E}}[Y_{i} (1) - Y_{i} (0)] = \tau_{ATE}\) is the average treatment effect (ATE).
  • We observe \(X_i\), \(T_i\) and, \(Y_i = T_i Y_{i} (1) + (1 - T_i )Y_{i} (0)\).

  • In this simple case OLS estimator solves the least squares problem:

    \[ (\widehat{\tau}, \widehat{\alpha}) = {\arg\!\min}_{\tau, \alpha} \sum_{i=1}^n \left(Y_i - \alpha - \tau T_i\right)^2 \]

  • Coefficient \(\tau\) is algebraically equivalent to the difference in means (\(\tau_{DiM}\)):

    \[ \widehat{\tau} = \bar{Y}_1 - \bar{Y}_0 = \widehat{\tau}_{DiM} \]

Regression Justification

  • Key assumptions: linearity and mean independence of errors. (why do we care about the latter?)
  • Using the switching equation we can show that:

\[ \begin{align*} Y_i &= T_i Y_i(1) + (1 - T_i) Y_i(0) \\ &= Y_i(0) + T_i ( Y_i(1) - Y_i(0) ) \quad\text{($\because$ distribute)}\\ &= Y_i(0) + \tau_i T_i \quad \text{($\because$ unit treatment definition)}\\ &= {\mathbb{E}}[Y_i(0)] + \tau T_i + ( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (\tau_i - \tau) \quad (\because \pm {\mathbb{E}}[Y_i(0)] + \tau T_i)\\ &= {\mathbb{E}}[Y_i(0)] + \tau T_i + (1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) \quad\text{($\because$ distribute)}\\ &= \alpha + \tau T_i + \eta_i \end{align*} \]

  • Linear functional form fully justified by SUTVA assumption alone:

    • Intercept: \(\alpha = {\mathbb{E}}[Y_i(0)]\) (average control outcome).
    • Slope: \(\tau = {\mathbb{E}}[Y_i(1) - Y_i(0)]\) (average treatment effect).
    • Error: deviation of control PO + treatment effect heterogeneity. What is the second interpretation?

Mean independent errors

  • The error is given by

\[ \eta_i = (1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) \]

  • In the regression context we would like \({\mathbb{E}}[\eta_i {\:\vert\:}T_i] = 0\)?

\[ \begin{align*} {\mathbb{E}}[\eta_i {\:\vert\:}T_i] &= {\mathbb{E}}[(1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) {\:\vert\:}T_i] \\ &= (1 - T_i) ({\mathbb{E}}[Y_i(0) {\:\vert\:}T_i] - {\mathbb{E}}[Y_i(0)]) + T_i ({\mathbb{E}}[Y_i(1) {\:\vert\:}T_i] - {\mathbb{E}}[Y_i(1)]) \end{align*} \]

  • Does this look familiar? This is selection with respect to \(Y_i(0)\) and \(Y_i(1)\).
  • When would this be equal to zero? E.g. under random assignment (strong ignorability)
  • Randomization + consistency allow linear model.

    • Does not imply homoskedasticity or normal errors, though!

    • Practical implication: Use heteroskedasticity-robust (HC2) standard errors for inference, e.g. via lm_robust().

Regression with Covariates

Why Control for Covariates?


  • We just showed: under strong ignorability (random assignment), regression estimates causal effects.

  • But what if treatment is not randomly assigned?

  • Key insight: If we can identify variables \(X_i\) that explain why some units are treated and others are not, we may still recover causal effects.
  • This is the selection on observables framework:

    • Treatment assignment depends on observable characteristics \(X_i\)
    • Once we account for \(X_i\), treatment is “as good as random”
    • We can use regression to adjust for \(X_i\) and estimate causal effects
  • Question: What assumptions do we need? And how do we implement this with regression?

Conditional Ignorability


Assumption: Conditional Ignorability (CIA)

\[ \{ Y_i(0), Y_i(1) \} {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i \]

Given covariates \(X_i\), treatment assignment is independent of potential outcomes.

  • Interpretation: Within groups defined by \(X_i\), treatment is effectively randomized.

    • Units with the same \(X_i\) who are treated vs. untreated are comparable.
    • Any remaining differences in outcomes reflect the causal effect of treatment.
  • This is weaker than strong ignorability: we allow \(T_i\) to depend on \(X_i\), just not on \((Y_i(0), Y_i(1))\) after conditioning on \(X_i\).
  • Critical requirement: We must observe and correctly measure all confounders \(X_i\).

Modeling Potential Outcomes with Covariates

  • To connect CIA to regression, we need to model how potential outcomes relate to covariates.
  • Assume constant treatment effects: \(\tau_i = \tau\) for all \(i\).

  • Potential outcomes follow: \(f_i(t) = \alpha + \tau t + \eta_i\)

    • \(\alpha\): baseline expected outcome
    • \(\tau\): constant causal effect of treatment
    • \(\eta_i\): individual-specific deviation (captures everything else affecting \(Y_i\))
  • The error \(\eta_i\) may depend on covariates. Decompose it as:

\[ \eta_i = X_i^{\prime} \gamma + \nu_i, \]

where \(\gamma\) captures the linear relationship between \(X_i\) and outcomes, and \(\nu_i\) is the residual variation.

  • Assumption: \({\mathbb{E}}[\eta_i {\:\vert\:}X_i] = X_i^\prime \gamma\) (linearity in covariates) \(\Rightarrow\) \({\mathbb{E}}[\nu_i {\:\vert\:}X_i] = 0\).

From Potential Outcomes to Regression

  • Substituting the error decomposition into observed outcomes:

\[ Y_i = f_i(T_i) = \alpha + \tau T_i + X_i^{\prime} \gamma + \nu_i \]

  • This looks like a regression equation. But when does OLS identify \(\tau\) causally?
  • Under CIA, \(f_i(t) {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i\), which implies:

\[ \nu_i {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i \]

  • Why? Since \(X_i\) is fixed, only \(\nu_i\) varies in \(\eta_i\). CIA on POs transfers to \(\nu_i\).
  • Result: The regression error \(\nu_i\) is:

    1. Uncorrelated with \(X_i\) (by construction of \(\gamma\))
    2. Uncorrelated with \(T_i\) conditional on \(X_i\) (by CIA)
  • Therefore, OLS on \(Y_i = \alpha + \tau T_i + X_i^{\prime} \gamma + \nu_i\) yields consistent estimates.

Causal Identification with Covariates

  • Let’s verify that \(\tau\) captures the causal effect under our assumptions.
  • By CIA, conditioning on \(X_i\) removes selection bias:

\[ {\mathbb{E}}[f_i(t) {\:\vert\:}T_i = t, X_i] = {\mathbb{E}}[f_i(t) {\:\vert\:}X_i] = \alpha + \tau t + X_i^{\prime} \gamma \]

  • The causal effect of changing treatment from \((t-v)\) to \(t\):

\[ \begin{align*} {\mathbb{E}}[f_i(t) - f_i(t - v) {\:\vert\:}X_i] &= (\alpha + \tau t + X_i^{\prime} \gamma) - (\alpha + \tau (t - v) + X_i^{\prime} \gamma) \\ &= \tau v \end{align*} \]

  • Key observation: The \(X_i^{\prime} \gamma\) terms cancel out!

    • Confounding enters the model additively and is differenced away.
    • The coefficient \(\tau\) represents the causal effect of a unit change in \(T_i\).

Regression with Covariates: Takeaways


  • What we assumed:

    1. Conditional ignorability: \((Y_i(0), Y_i(1)) {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i\)
    2. Constant effects: \(\tau_i = \tau\) for all units
    3. Linearity: \({\mathbb{E}}[\eta_i {\:\vert\:}X_i] = X_i^\prime \gamma\)
  • What we achieved:

    • OLS regression of \(Y_i\) on \(T_i\) and \(X_i\) consistently estimates the causal effect \(\tau\)
    • Covariates “absorb” confounding, leaving only causal variation in \(T_i\)
  • Critical question: What happens if we omit important confounders from \(X_i\)?

    • If CIA fails because we missed a confounder, our estimate will be biased.
    • This leads us to omitted variable bias

Omitted Variable Bias

Omitted Variable Bias (OVB)

  • Now suppose we erroneously omit \(X_i\), and just regress \(Y_i\) on \(T_i\) via OLS.
  • To see omitted variable bias we look at what the coefficient on \(T_i\) estimates, \(\frac{{\mathrm{cov}}(Y_i, T_i)}{{\mathbb{V}}(T_i)}\) assuming that the true model should include \(X_i\):

\[ \begin{align*} {\mathrm{cov}}(Y_i, T_i) &= {\mathrm{cov}}(\alpha + \tau T_i + X_i' \gamma + \nu_i,\, T_i) \\ &= \tau {\mathrm{cov}}(T_i, T_i) + {\mathrm{cov}}(X_{1i} \gamma_1 + \ldots + X_{Ki} \gamma_K, T_i) \\ &= \tau {\mathbb{V}}(T_i) + \gamma_1 {\mathrm{cov}}(X_{1i}, T_i) + \ldots + \gamma_K {\mathrm{cov}}(X_{Ki}, T_i) \end{align*} \]

\[ \implies \frac{{\mathrm{cov}}(Y_i, T_i)}{{\mathbb{V}}(T_i)} = \tau + \underbrace{\gamma^{\prime} \delta}_{\text{OVB}} \]

where \(\delta\) are coefficients from regressions of \(X_1, \ldots, X_K\) on \(T_i\).

  • By the Frisch–Waugh–Lovell theorem, if we include some of \(X_i\) we will get \(\frac{{\mathrm{cov}}(\tilde{Y}_i, \tilde{T}_i)}{{\mathbb{V}}(\tilde{T}_i)} = \tau + \tilde{\gamma}^{\prime}\tilde{\delta}\), where \(\tilde{\cdot}\) means residualized with respect to included terms from \(X_i\).

Omitted Variable Bias


  • OVB = \(\gamma^\prime \delta\), where

    • \(\gamma\) is the vector of effects of confounders on the outcome.
    • \(\delta\) is the vector of associations between confounders and treatment — i.e., the degree of confounder-induced imbalance in treatment assignment.
  • Same holds when we consider the case where we include some controls:

    \[ \text{OVB} = \tilde{\gamma}' \tilde{\delta}. \]

    Everything is just defined in terms of variables that have been residualized with respect to the included controls.

  • OVB = confounder impact \(\times\) imbalance (Cinelli and Hazlett 2020).

Omitted Variable Bias

  • Let’s practice applying the OVB formula:

    OVB = \((X_{ki}, Y_i)\) relationships \(\times\) \((X_{ki}, T_i)\) relationships

G Y Y T T T->Y X X X->Y X->T

  1. Effect of democratic institutions on growth, estimated via regression of growth on democratic institutions.

  2. Effect of exposure to negative advertisements on turnout, estimated via regression of turnout on the number of ads seen.

  • Question: What is a possible omitted variable? How will this bias the estimate?

OVB: Simulate DAG Relationship

set.seed(20250127) # set seed

n <- 1000 # sample size
tau <- 0.5 # ATE
gamma <- 0.3 # effect of confounder on outcome
delta <- 0.3 # effect of confounder on treatment

# confounder
confounder <- rnorm(n, mean = 50, sd = 10)

# democratic institutions (correlated with confounder)
democracy_score <- delta * confounder + rnorm(n, mean = 0, sd = 5)

# economic growth (influenced by both investment and democratic institutions)
growth <- tau *
  democracy_score +
  gamma * confounder +
  rnorm(n, mean = 0, sd = 5)

# true regression including the confounder
model_unbiased <- lm(growth ~ democracy_score + confounder)
cat("Unbiased model error:", unname(model_unbiased$coefficients[2]) - tau, "\n")

Unbiased model error: -0.01922573

# regression ignoring the confounder
model_biased <- lm(growth ~ democracy_score)
cat("Biased model error:", unname(model_biased$coefficients[2]) - tau, "\n")

Biased model error: 0.3081032

OVB: High \(\gamma\), High \(\delta\)

OVB: High \(\gamma\), Low \(\delta\)

OVB: Low \(\gamma\), High \(\delta\)

OVB: Low \(\gamma\), Low \(\delta\)

Be Careful!




  • Omitted variables is a misleading term because it could suggest that you want to include any variable that is correlated with treatment and outcomes.

  • But remember bad controls exist, e.g.

    • Common descendants of treatment and outcome (colliders)
    • Block causal path by controlling for post-treatment variables

Two Ways to Adjust for Covariates


  • The discussion of OVB suggests that we can use regression to adjust for variables (\(X_i\)) to estimate the treatment effect (\(\tau\)) in two ways.

    1. Long regression: Include covariates \(X_i\) directly in the regression model.

    2. Residualized regression:

      1. Purge variation in \(Y_i\) due to \(X_i\) \(\rightarrow\) Regress \(Y_i\) on \(X_i\) and calculate residual outcomes: \(\tilde{Y}_i = Y_i - \widehat{Y}_i\).
      2. Purge variation in \(T_i\) due to \(X_i\) \(\rightarrow\) Regress \(T_i\) on \(X_i\) and calculate residual treatments: \(\tilde{T}_i = T_i - \widehat{T}_i\).
      3. Regress \(\tilde{Y}_i\) on \(\tilde{T}_i\).
  • Result: Coefficient on \(T_{i}\) in long regression and on \(\tilde{T}_i\) in residualized regression are identical.

Back-Door Criterion

Identification Analysis with Causal Graphs

  • An alternative, perhaps more intuitive, way to think about confounding is in terms of DAGs.
  • Suppose we want to estimate the ATE of \(T\) on \(Y\); which covariates do we need to measure?
  • Pearl develops criteria, which can be directly read off the graph alone.
  • Before studying the criteria, we need to define some new concepts.
  • Nodes: \(T\), \(Y\), \(Z_1\), \(Z_2\), and \(Z_3\).

  • Paths: \(T \to Y\), \(T \leftarrow Z_3 \to Y\), \(T \leftarrow Z_1 \to Z_3 \leftarrow Z_2 \to Y\), etc.

  • \(Z_1\) is a parent of \(T\) and \(Z_3\).

  • \(T\) and \(Z_3\) are children of \(Z_1\).

  • \(Z_1\) is an ancestor of \(Y\).

  • \(Y\) is a descendant of \(Z_1\).

G T T Y Y T->Y Z1 Z 1 Z1->T Z3 Z 3 Z1->Z3 Z2 Z 2 Z2->Y Z2->Z3 Z3->T Z3->Y

Back-Door vs. Causal Paths

Definition: Types of Paths

A causal (front-door) path from \(T\) to \(Y\) is a path where every arrow points away from \(T\) toward \(Y\): \(T \to \cdots \to Y\)

A back-door path from \(T\) to \(Y\) is any path that starts with an arrow into \(T\): \(T \leftarrow \cdots\)

  • Causal paths transmit the effect of \(T\) on \(Y\) — we want to keep these open!

  • Back-door paths create spurious associations (confounding) — we want to block these.

G T T W W T->W Y Y Z Z Z->T Z->Y W->Y

  • Example: \(T \to W \to Y\) is a causal path. \(T \leftarrow Z \to Y\) is a back-door path.

  • Intuition: Back-door paths are “alternative explanations” for why \(T\) and \(Y\) might be correlated, even if \(T\) has no causal effect on \(Y\).

Colliders: A Key Concept


  • A collider on a path is a node where two arrows “collide” (point into it): \(\to C \leftarrow\)
  • Key insight: Colliders have special properties:

    • A path through a collider is naturally blocked — no information flows through it.
    • Conditioning on a collider opens the path! This can create spurious associations or mask real relationship.
  • Example: \(Z_3\) is a collider on the path \(Z_1 \to Z_3 \leftarrow Z_2\).

    • Without conditioning: \(Z_1 {\mbox{$\perp\!\!\!\perp$}}Z_2\) (path blocked).
    • Conditioning on \(Z_3\): \(Z_1 {\mbox{$\centernot{\perp\!\!\!\perp}$}}Z_2 {\:\vert\:}Z_3\) (path opened!).

G Z1 Z 1 Z3 Z 3 Z1->Z3 Z2 Z 2 Z2->Z3

Colliders in Real Life

Blocking Paths

Definition: Blocked Paths

A path \(p\) is blocked by a set of nodes \(X\) if:

  1. \(p\) contains a non-collider that is in \(X\) (conditioning blocks flow), OR
  2. \(p\) contains a collider where neither the collider nor its descendants are in \(X\) (naturally blocked).
  • Intuition: Conditioning on non-colliders blocks information flow; conditioning on colliders opens it.

G T T W3 W 3 T->W3 Y Y Z3 Z 3 Z3->T Z3->Y Z1 Z 1 Z1->Z3 W1 W 1 Z1->W1 Z2 Z 2 Z2->Z3 W2 W 2 Z2->W2 W1->T W2->Y W3->Y

  • \(T \leftarrow W_1 \leftarrow Z_1 \to Z_3 \to Y\): blocked by \(\{W_1\}\) or \(\{Z_1\}\) (non-colliders).
  • \(T \leftarrow W_1 \leftarrow Z_1 \to Z_3 \leftarrow Z_2 \to W_2 \to Y\): blocked by \(\{\emptyset\}\) (\(Z_3\) is a collider).

\(d\)-Separation




Definition: \(d\)-separation

A set \(X\) \(d\)-separates \(T\) and \(Y\) if \(X\) blocks all paths between \(T\) and \(Y\).

If \(X\) \(d\)-separates \(T\) and \(Y\), then \(Y {\mbox{$\perp\!\!\!\perp$}}T {\:\vert\:}X\).

  • Intuition: \(d\)-separation is a graphical criterion for conditional independence — if all paths are blocked, the variables are independent given \(X\).

The Back-Door Criterion

Theorem: The Back-Door Criterion

A set \(X\) satisfies the back-door criterion relative to \((T, Y)\) if:

  1. \(X\) blocks (\(d\)-separates) all back-door paths from \(T\) to \(Y\), and
  2. No element of \(X\) is a descendant of \(T\).
  • Why condition 1? Blocking back-door paths eliminates confounding — the spurious association between \(T\) and \(Y\).

  • Why condition 2? Descendants of \(T\) are post-treatment variables. Conditioning on them could block part of the causal effect (if they’re mediators), and/or induce collider/selection bias (if they’re affected by \(T\) and share causes with \(Y\), etc.).

  • Result: If \(X\) satisfies the back-door criterion, then it implies conditional ignorability (\(Y_i(t) {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i\)):

\[ \begin{align*} {\mathbb{E}}[Y_i(t)] &= {\mathbb{E}}_X[{\mathbb{E}}[Y_i {\:\vert\:}T_i = t, X_i]] \implies \\ \implies\ \tau_{ATE} &= {\mathbb{E}}[Y_i(1)] - {\mathbb{E}}[Y_i(0)] = {\mathbb{E}}_X[{\mathbb{E}}[Y_i {\:\vert\:}T_i = 1, X_i] - {\mathbb{E}}[Y_i {\:\vert\:}T_i = 0, X_i]] \end{align*} \]

Back-Door Criterion: Example


G T T W3 W 3 T->W3 Y Y Z3 Z 3 Z3->T Z3->Y Z1 Z 1 Z1->Z3 W1 W 1 Z1->W1 Z2 Z 2 Z2->Z3 W2 W 2 Z2->W2 W1->T W2->Y W3->Y

  • Back-door paths from \(T\) to \(Y\):
    1. \(T \leftarrow Z_3 \to Y\)
    2. \(T \leftarrow W_1 \leftarrow Z_1 \to Z_3 \to Y\)
    3. \(T \leftarrow Z_3 \leftarrow Z_2 \to W_2 \to Y\)
    4. \(T \leftarrow W_1 \leftarrow Z_1 \to Z_3 \leftarrow Z_2 \to W_2 \to Y\)
  • Example: Which sets satisfy the back-door criterion?
    • \(X = \{W_1, W_2\}\)? No — doesn’t block path 1.
    • \(X = \{Z_3\}\)? No — opens path 3 (collider!).
    • \(X = \{Z_1, Z_3\}\)? Yes! Blocks all back-door paths.
    • \(X = \{W_3\}\)? No\(W_3\) is a descendant of \(T\).

The Good, The Bad, The Ugly… Controls

Choosing Controls: A Taxonomy



  • Follow Cinelli, Forney, and Pearl (2024) which provides a systematic framework for thinking about control variables.

  • Key insight: Not all variables that are correlated with treatment and outcome should be controlled for!

  • We will classify controls as:

    1. Good controls: Block back-door paths without introducing bias
    2. Neutral controls: Neither help nor hurt identification (but may affect precision)
    3. Bad controls: Introduce bias through collider conditioning or post-treatment adjustment

Good Controls 1


  • confounder is a common cause of main explanatory variable, \(X\), and outcome of interest, \(Y\).
  • In model (a) \(Z\) is a common cause of \(X\) and \(Y\). Controlling for \(Z\) blocks the back-door path.
  • In models (b) and (c) \(Z\) is not a common cause, but controlling for \(Z\) blocks the back-door path due to unobserved confounder \(U\).

G cluster_a (a) cluster_c (c) cluster_b (b) X_c X Y_c Y X_c->Y_c Z_c Z Z_c->Y_c U_c U U_c->X_c U_c->Z_c X_b X Y_b Y X_b->Y_b Z_b Z Z_b->X_b U_b U U_b->Y_b U_b->Z_b Z Z X X Z->X Y Y Z->Y X->Y

Good Controls 2



  • Intuition: Common causes of \(X\) and any mediator \(M\) (between \(X\) and \(Y\)) also confound the effect of \(X\) on \(Y\).
  • Models (a)-(c) are analogous to the models without mediator – controlling for \(Z\) blocks the back-door path from \(X\) to \(Y\) (through \(M\)) and produces an unbiased estimate of the ATE.

G cluster_b (b) cluster_a (a) cluster_c (c) X_c X M_c M X_c->M_c Y_c Y M_c->Y_c Z_c Z Z_c->M_c U_c U U_c->X_c U_c->Z_c X_b X M_b M X_b->M_b Y_b Y M_b->Y_b Z_b Z Z_b->X_b U_b U U_b->M_b U_b->Z_b Z Z X X Z->X M M Z->M X->M Y Y M->Y

Neutral (?) Controls



  • Intuition: Ancestors of only \(X\), only \(Y\), or only \(M\) (mediator) do not introduce bias. Controling for these factors will reduce variation in respective variable that is not related to the variation in other variable.
  • In model (a) reduction in variation is good! \(\rightarrow\) higher precision

  • In model (b) reduction in variation is bad! \(\rightarrow\) lower precision

  • In model (c) reduction in variation is good again! \(\rightarrow\) higher precision

G cluster_a (a) cluster_b (b) cluster_c (c) X_c X M_c M X_c->M_c Y_c Y M_c->Y_c Z_c Z Z_c->M_c X_b X Y_b Y X_b->Y_b Z_b Z Z_b->X_b Z Z Y Y Z->Y X X X->Y

Bad Controls: Selection Bias



  • Intuition: We do not want to control for colliders or their descendants. This induces selection bias
  • In models (a) and (b) controlling for \(Z\) unblocks back-door paths and induces relationship between \(X\) and \(Y\).

  • In models (c) and (d) controlling for \(Z\) will unblock the back-door path \(X \rightarrow Z \leftarrow U \rightarrow Y\).

G cluster_d (d) cluster_a (a) cluster_c (c) cluster_b (b) X_d X Y_d Y X_d->Y_d Z_d Z Z_d->Y_d U1_d U1 U1_d->X_d U1_d->Z_d U2_d U2 U2_d->Y_d U2_d->Z_d X_c X Y_c Y X_c->Y_c Z_c Z U1_c U1 U1_c->X_c U1_c->Z_c U2_c U2 U2_c->Y_c U2_c->Z_c X_b X Y_b Y X_b->Y_b Z_b Z X_b->Z_b U_b U U_b->Y_b U_b->Z_b Z Z X X X->Z Y Y X->Y Y->Z

Bad Controls: Post-Treatment Bias



  • Intuition: We do not want to block the channels through which the effect goes (unless we are interested in \(CATE\)). This induces post-treatment bias
  • In models (a) and (b) controlling for \(Z\) blocks the causal path.

  • In model (c) controlling for \(Z\) blocks part of the causal path.

  • In model (d) controlling for \(Z\) will not block the causal path or induce any bias.

G cluster_a (a) cluster_d (d) cluster_c (c) cluster_b (b) X_d X Y_d Y X_d->Y_d Z_d Z X_d->Z_d X_c X Y_c Y X_c->Y_c Z_c Z X_c->Z_c Z_c->Y_c X_b X M_b M X_b->M_b Y_b Y M_b->Y_b Z_b Z M_b->Z_b Z Z Y Y Z->Y X X X->Z

Bad Controls: Post-Treatment Bias


  • To see the intuition behind post-treatment bias consider the following example

  • Suppose \(X = 0, 1\) randomly assigned, and then

    \[ \begin{align*} Z &= X + \varepsilon_Z, \\ Y &= \beta X + \gamma Z + \varepsilon_Y, \end{align*} \]

    where \(\varepsilon_Z\) and \(\varepsilon_Y\) are independent standard normal draws.

  • Substituting in \(Y\):

    \[ Y = (\beta + \gamma)X + \gamma \varepsilon_Z + \varepsilon_Y \]

  • Effect of \(X\) on \(Y\) is \(\beta + \gamma\).

  • Controlling for \(Z\), we would estimate an effect of \(\beta\).

  • The bias, \(-\gamma\), is the portion of the effect that has been “stolen away” by conditioning on \(Z\).

Controls Conclusion



  • Be mindful of what controls you include in your analysis (even if it is an experiment).

  • Draw a DAG with controls you plan to include and see whether

    • You need them to block any back-door paths.
    • They might be colliders or introduce post-treatment bias.
    • Do not use “kitchen sink” approach!
  • Be also mindful of the sizes of the effects of potential confounders. If the effect on main independent and dependent variable can be proven to be limited, the OVB is small!

Regression with Heterogeneous Treatments

What If Treatment Effects Vary?



  • Thus far we assumed constant effects (\(\tau_i = \tau\)) and linearity (\({\mathbb{E}}[\eta_i {\:\vert\:}X_i] = X'_i \gamma\)).

  • These are strong assumptions! What happens if treatment effects vary across units, e.g. with respect to \(X\)?

  • Setup with heterogeneous effects:

    • Binary treatment: \(T_i \in \{0,1\}\)
    • Potential outcomes: \(Y_{i}(0)\), \(Y_{i}(1)\)
    • Unit-level treatment effect: \(\tau_i = Y_{i}(1) - Y_{i}(0)\) (varies across \(i\)!)
    • Discrete covariate: \(X_i \in \{x_1, x_2, \ldots, x_L\}\)
  • Maintain conditional ignorability (CIA): \(\{ Y_{i}(0), Y_{i}(1) \} {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i\)

  • Goal: Estimate \(\tau_{ATE} = {\mathbb{E}}[\tau_i]\) using regression.

The Target: Average Treatment Effect



  • Under CIA, the ATE can be written as a weighted average of conditional ATEs:

    \[ \begin{align*} \tau_{ATE} &= {\mathbb{E}}[\tau_i] = {\mathbb{E}}_{X} [{\mathbb{E}}[Y_i(1) - Y_i(0) {\:\vert\:}X_i]] \\ &= {\mathbb{E}}_{X} [\underbrace{{\mathbb{E}}[Y_i(1) {\:\vert\:}X_i] - {\mathbb{E}}[Y_i(0) {\:\vert\:}X_i]}_{\tau_x}] \\ &= \sum_{x} \tau_x {\textrm{Pr}}(X_i = x), \end{align*} \]

    where \(\tau_x \equiv {\mathbb{E}}[Y_i(1) - Y_i(0) {\:\vert\:}X_i = x]\) is the CATE for stratum \(x\).

  • Key insight: The ATE averages stratum-specific effects \(\tau_x\) by their population shares \({\textrm{Pr}}(X_i = x)\).

Estimating with Saturated Regression



  • To flexibly control for \(X_i\), use a saturated regression with dummies for each unique value of \(X_i\) (one-way fixed effects):

    \[ Y_i = \alpha_1 \mathbb{1}[X_i = x_1] + \cdots + \alpha_L \mathbb{1}[X_i = x_L] + \tau T_i + \varepsilon_i, \]

    where \(\mathbb{1}[\cdot]\) is the indicator function. (One \(\alpha\) omitted if including intercept.)

  • Why saturated?

    • Makes no assumption about functional form of \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\)
    • Each stratum \(x\) gets its own intercept \(\alpha_x\)
    • This is the most flexible linear specification for discrete \(X_i\)
  • Question: Does \(\widehat{\tau}\) from this regression estimate \(\tau_{ATE}\)?

Regression Anatomy

  • Regression anatomy can be written in two ways: \(\widehat{\tau} = \frac{{\mathrm{cov}}(\tilde{Y}_i, \tilde{T}_i)}{{\mathbb{V}}(\tilde{T}_i)} = \frac{{\mathrm{cov}}(Y_i, \tilde{T}_i)}{{\mathbb{V}}(\tilde{T}_i)}\), where \(\tilde{\cdot}_i\) is residuals from regression of \(T_i\) on other regressors.
# simulate data
n <- 1000
X <- rnorm(n)
D <- 0.5 * X + rnorm(n) # do not use T!!!
Y <- 2 * D + 1 * X + rnorm(n)

# standard regression
standard <- coef(lm(Y ~ D + X))["D"]

# make Y tilde and D tilde
tilde_Y <- lm(Y ~ X)$residuals
tilde_D <- lm(D ~ X)$residuals

# regression anatomy
anatomy <- coef(lm(tilde_Y ~ tilde_D))["tilde_D"]

# simplified regression anatomy
anatomy_simp <- coef(lm(Y ~ tilde_D))["tilde_D"]

data.frame(
  Method = c("Standard", "Regression Anatomy", 
  "Regression Anatomy (Simplified)"),
  Coefficient = c(standard, anatomy, anatomy_simp)
) |>
  knitr::kable(digits = 3)
Method Coefficient
Standard 1.978
Regression Anatomy 1.978
Regression Anatomy (Simplified) 1.978

Deriving What OLS Estimates

  • Define the residualized treatment: \(\tilde{T}_i \equiv T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i]\)

  • Key property: \({\mathbb{E}}[\tilde{T}_i] = 0\) (residuals have mean zero)

  • Start from regression anatomy and simplify the covariance:

    \[ \begin{align*} \widehat{\tau} &= \frac{{\mathrm{cov}}(Y_i, \tilde{T}_i)}{{\mathbb{V}}(\tilde{T}_i)} = \frac{{\mathbb{E}}[Y_i \tilde{T}_i] - {\mathbb{E}}[Y_i]\textcolor{#d65d0e}{{\mathbb{E}}[\tilde{T}_i]}}{{\mathbb{E}}[\tilde{T}_i^2]} \\ &= \frac{{\mathbb{E}}[Y_i \tilde{T}_i]}{{\mathbb{E}}[\tilde{T}_i^2]} \quad \text{($\because$ ${\mathbb{E}}[\tilde{T}_i] = 0$)} \end{align*} \]

  • Apply law of iterated expectations to the numerator:

    \[ {\mathbb{E}}[Y_i \tilde{T}_i] = {\mathbb{E}}\big[{\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i] \tilde{T}_i\big] \]

    This works because \(\tilde{T}_i\) is a function only of \(T_i\) and \(X_i\) and is constant when \(T_i\) and \(X_i\) are fixed.

Expanding the Conditional Expectation



  • So we have: \(\widehat{\tau} = \frac{{\mathbb{E}}\big[{\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i] \tilde{T}_i\big]}{{\mathbb{E}}[\tilde{T}_i^2]}\)

  • Expand \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\) using the switching equation: \(Y_i = T_i Y_i(1) + (1-T_i) Y_i(0)\)

    \[ \begin{align*} {\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i] &= T_i {\mathbb{E}}[Y_{i}(1) {\:\vert\:}T_i, X_i] + (1-T_i) {\mathbb{E}}[Y_{i}(0) {\:\vert\:}T_i, X_i] \\ &= T_i {\mathbb{E}}[Y_{i}(1) {\:\vert\:}X_i] + (1-T_i) {\mathbb{E}}[Y_{i}(0) {\:\vert\:}X_i] \quad \text{($\because$ CIA)}\\ &= T_i \big({\mathbb{E}}[Y_{i}(1) {\:\vert\:}X_i] - {\mathbb{E}}[Y_{i}(0) {\:\vert\:}X_i]\big) + {\mathbb{E}}[Y_{i}(0) {\:\vert\:}X_i] \quad \text{($\because$ rearrange)}\\ &= T_i \tau_x + {\mathbb{E}}[Y_i(0) {\:\vert\:}X_i] \end{align*} \]

Completing the Derivation

  • Substitute \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i] = T_i \tau_x + {\mathbb{E}}[Y_i(0) {\:\vert\:}X_i]\) into our expression:

\[ \begin{align*} \widehat{\tau} &= \frac{{\mathbb{E}}\big[(T_i \tau_x + {\mathbb{E}}[Y_i(0) {\:\vert\:}X_i]) \tilde{T}_i\big]}{{\mathbb{E}}[\tilde{T}_i^2]} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}[T_i \tau_x \tilde{T}_i] + {\mathbb{E}}[{\mathbb{E}}[Y_i(0) {\:\vert\:}X_i] \tilde{T}_i]}{{\mathbb{E}}[\tilde{T}_i^2]} \quad \text{($\because$ distribute)}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}[T_i \tau_x \tilde{T}_i]}{{\mathbb{E}}[\tilde{T}_i^2]} \quad \text{($\because$ ${\mathbb{E}}[{\mathbb{E}}[Y_i(0) {\:\vert\:}X_i] (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])] = 0$)}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}_X[\tau_x {\mathbb{E}}[T_i \tilde{T}_i {\:\vert\:}X_i]]}{{\mathbb{E}}_X[{\mathbb{E}}[\tilde{T}_i^2 {\:\vert\:}X_i]]} \quad \text{($\because$ law of iterated expectations)}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}_X[\tau_x {\mathbb{E}}[T_i (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i]) {\:\vert\:}X_i] ]}{{\mathbb{E}}_X[{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2 {\:\vert\:}X_i]]} \quad \text{($\because$ definition of $\tilde{T}_i$ )}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}_X[\tau_x {\mathbb{V}}(T_i {\:\vert\:}X_i)]}{{\mathbb{E}}_X[{\mathbb{V}}(T_i {\:\vert\:}X_i)]} \quad \text{($\because$ ${\mathbb{E}}[T_i \tilde{T}_i {\:\vert\:}X_i] = {\mathbb{V}}(T_i {\:\vert\:}X_i)$)}} \end{align*} \]

What Was This?

Comparing ATE vs. OLS Estimand

  • Compare the target vs. what OLS estimates:

    \[ \tau_{ATE} = \sum_{x} \tau_x {\textrm{Pr}}(X_i = x), \]

    versus (in binary \(T_i\) case)

    \[ \widehat{\tau} \xrightarrow{p} \frac{{\mathbb{E}}_X[\tau_x {\mathbb{V}}(T_i {\:\vert\:}X_i)]}{{\mathbb{E}}_X[{\mathbb{V}}(T_i {\:\vert\:}X_i)]} = \frac{\sum_x \tau_x \textcolor{#d65d0e}{p_x(1-p_x)} {\textrm{Pr}}(X_i = x)}{\sum_x \textcolor{#d65d0e}{p_x(1-p_x)} {\textrm{Pr}}(X_i = x)} \]

    where \(p_x = {\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x)\).

  • \(\widehat{\tau}\) aggregates via conditional variance weighting with respect to \(T_i\) instead of just population shares.

  • If \(\tau_x\) was constant across \(X_i\), regression recovers ATE, but variance weighting could reduce efficiency (more uncertainty).

  • If \(T_i {\mbox{$\perp\!\!\!\perp$}}X_i\), then \(p_x(1-p_x)\) is constant across strata and cancels out, so \(\widehat{\tau}\) reduces to weighting by \({\textrm{Pr}}(X_i = x)\).

Truth about Regression


  • Logic carries through to continuous treatments (Angrist and Pischke 2009, 77–80; Aronow and Samii 2016).

  • Aronow and Samii (2016) show that for arbitrary \(T_i\) and \(X_i\),

    \[ \widehat{\tau} \xrightarrow{p} \frac{{\mathbb{E}}[w_i \tau_i]}{{\mathbb{E}}[w_i]}, \quad \text{where } w_i = (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2, \]

    in which case

    \[ {\mathbb{E}}[w_i {\:\vert\:}X_i] = {\mathbb{V}}[T_i {\:\vert\:}X_i]. \]

  • The effective sample is weighted by \(\widehat{w}_i = (T_i - \widehat{{\mathbb{E}}}[T_i {\:\vert\:}X_i])^2\) (squared residual from regression of \(T_i\) on covariates).

  • Even with a representative sample, regression estimates may not aggregate effects in a representative manner. Regression estimates are local to an effective sample.

Let’s Try a Simulation


set.seed(20250202) # set seed

n <- 1000 # sample size
tau_base <- 0.5
gamma <- 0.1 # effect of X on outcome

# some discrete covariate
X <- sample(x = 1:100, size = n, replace = T)

# total treatment effect (assuming possible heterogeneity)
tau_total <- sum((tau_base + 0.01 * 1:100) / 100)

# democratic institutions (correlated with confounder)
democracy_high <- rbinom(n, size = 1, prob = .5)
democracy_high_2 <-
  rbinom(n, size = 1, prob = sapply(X, function(x) .5 + 0.01 * x))

# economic growth (influenced by both investment and democratic institutions)
growth <-
  (tau_base + 0.01 * X) *
  democracy_high +
  gamma * X +
  rnorm(n, mean = 0, sd = 5)

growth_2 <-
  (tau_base + 0.01 * X) *
  democracy_high_2 +
  gamma * X +
  rnorm(n, mean = 0, sd = 5)

# regression with constant assignment
bias1 <- lm(growth ~ democracy_high + factor(X))$coefficients[2] - tau_total

# regression with variable assignment
bias2 <- lm(growth_2 ~ democracy_high_2 + factor(X))$coefficients[2] - tau_total

Heterogeneous \(\tau\) and Assignment

Heterogeneous Assignment Only

Lessons

  • Regression is a useful tool for estimating causal effects and accounting for CIA:
    • Binary treatments: regression can provide consistent estimates of ATE.
    • Discrete or continuous treatments: Estimates provide a best linear approximation when relationship is non-linear.
    • Heterogeneous treatments: Regression estimates are weighted by conditional variance and could be biased and apply to effective sample only.
  • Beware of OVB:
    • Avoid “bad controls” that may introduce post-treatment bias or inadvertently open back-door paths.
  • Steps to take:
    1. Be explicit about the assumption you need to make to make regression causal.
    2. Use DAGs and the back-door criterion to identify covariates to control for.
    3. Make sure you know how to interpret regression coefficients.
    4. Use simulations and/or Dagitty to validate model assumptions and relationships.

Appendix

Frisch-Waugh-Lovell Theorem 🔙


  • Consider a multiple regression model: \(Y_i = \alpha + \tau T_i + X_i^\prime \gamma + \nu_i\).

  • To find \(\widehat{\tau}\), the coefficient on \(T_i\), the Frisch-Waugh-Lovell Theorem states that:

    1. Regress \(Y_i\) on \(X_i\) and obtain the residuals \(\tilde{Y}_i = Y_i - X_i^\prime \widehat{\pi}_Y\).

    2. Regress \(T_i\) on \(X_i\) and obtain the residuals \(\tilde{T}_i = T_i - X_i^\prime \widehat{\pi}_T\).

    3. Regress \(\tilde{Y}_i\) on \(\tilde{T}_i\) to obtain \(\widehat{\tau}\).

    In addition, the \(R^2\) and F-statistics of these regressions will be the same as those from the full model regression.

  • Intuition:

    • The Frisch-Waugh-Lovell theorem decomposes the estimation process.
    • Adjusts \(Y_i\) and \(T_i\) for covariates \(X_i\) separately, highlighting the direct effect of \(T_i\).

References

Angrist, Joshua D, and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.
Aronow, Peter M, and Cyrus Samii. 2016. “Does Regression Produce Representative Estimates of Causal Effects?” American Journal of Political Science 60 (1): 250–67.
Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2024. “A Crash Course in Good and Bad Controls.” Sociological Methods & Research 53 (3): 1071–1104.
Cinelli, Carlos, and Chad Hazlett. 2020. “Making Sense of Sensitivity: Extending Omitted Variable Bias.” Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (1): 39–67.